texto en cursiva
Context
Renewable energy sources play an increasingly important role in the global energy mix, as the effort to reduce the environmental impact of energy production increases.
Out of all the renewable energy alternatives, wind energy is one of the most developed technologies worldwide. The U.S Department of Energy has put together a guide to achieving operational efficiency using predictive maintenance practices.
Predictive maintenance uses sensor information and analysis methods to measure and predict degradation and future component capability. The idea behind predictive maintenance is that failure patterns are predictable and if component failure can be predicted accurately and the component is replaced before it fails, the costs of operation and maintenance will be much lower.
The sensors fitted across different machines involved in the process of energy generation collect data related to various environmental factors (temperature, humidity, wind speed, etc.) and additional features related to various parts of the wind turbine (gearbox, tower, blades, break, etc.).
Objective
“ReneWind” is a company working on improving the machinery/processes involved in the production of wind energy using machine learning and has collected data of generator failure of wind turbines using sensors. They have shared a ciphered version of the data, as the data collected through sensors is confidential (the type of data collected varies with companies). Data has 40 predictors, 20000 observations in the training set, and 5000 in the test set.
The objective is to build various classification models, tune them, and find the best one that will help identify failures so that the generators could be repaired before failing/breaking to reduce the overall maintenance cost. The nature of predictions made by the classification model will translate as follows:
True positives (TP) are failures correctly predicted by the model. These will result in repair costs. False negatives (FN) are real failures where there is no detection by the model. These will result in replacement costs. False positives (FP) are detections where there is no failure. These will result in inspection costs. It is given that the cost of repairing a generator is much less than the cost of replacing it, and the cost of the inspection is less than the cost of repair.
“1” in the target variables should be considered as “failure” and “0” represents “No failure”.
Data Description
-The data provided is a transformed version of original data which was collected using sensors.
-Train.csv - To be used for training and tuning of models.
-Test.csv - To be used only for testing the performance of the final best model.
-Both the datasets consist of 40 predictor variables and 1 target variable
# To help with reading and manipulating data
import pandas as pd
import numpy as np
# To help with data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
# To be used for missing value imputation
from sklearn.impute import SimpleImputer
# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier
# To get different metric scores, and split data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
)
# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
# This will help in making the Python code more structured automatically (good coding practice)
#%load_ext nb_black
!pip install numpy
!pip install scipy
df2 = pd.read_csv("Train.csv.csv")
df1 = pd.read_csv("Test.csv.csv")
frames = [df1, df2]
df = pd.concat(frames)
display(df)
The main steps to get an overview of any dataset are the following:
-Observe the first few rows of the dataset, to check whether the dataset has been loaded properly or not.
-Get information about the number of rows and columns in the dataset.
-Find out the data types of the columns to ensure that data is stored in the preferred format and the value of each property is as expected.
-Check the statistical summary of the dataset to get an overview of the numerical columns of the data.
# Check the first 5 rows of the data
df.head()
So far we have 40 predictors for the target variable, which represent different sensors trying to test if the machinery will fail or require maintenance in the future. This could be represented by the target coefficient which can be 1 for failure, or 0 for not failure.
# Check the number of rows and columns in the data
df.shape
The dataframe is composed by 25000 rows and 41 attributes.
# Check the data types of the columns in the dataset
df.info()
Our data is made by 39 floating variables and one target integer variable.
-The two first variables have less than 25000 rows, we will further check the best treatment for our missing values.
# Statistical summary of the dataset.
df.describe(include="all").T
# Check for duplicate values in the data
df.duplicated().sum()
There are no duplicated values in our data.
# check for missing values in the data
round(df.isnull().sum() / df.isnull().count() * 100, 2)
Our two first variables have 0.1 % of their values missed.
data = df.copy()
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
for feature in df1.columns:
histogram_boxplot(df1, feature, figsize=(12, 7), kde=False, bins=None)
-Most of our variables have a normal distribution with no or very little skewness.
-The variables V11, V15, V16 and V21 have a particularly higher skewness, we should further check them and give them an appropiate treatment as for assuming a normal distribution in our models.
plt.figure(figsize=(28, 20))
sns.heatmap(data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
Among the most significant correlations within our variables we found, having set a standard coefficient of more than 0.70 for a strong correlation, are the following :
-V2 with the variables V14 and V26.
-V3 with V23.
-V6 with the variables V11 and V20.
-V7 with V15.
-V8 with the variables V15 and V23.
-V9 with V16.
-V11 with V29.
-V14 with V38.
-V16 with V21.
-V17 with V27.
-V19 with V24.
-V21 with V35.
-V24 with the variables V27 and V32.
-V25 with V27, V30, V32 and V33.
-V27 with V32.
-V36 with V39.
plt.figure(figsize=(12, 5))
sns.regplot(y=df.V2, x=df.V14)
There's a significant negative correlation between variables V2 and V14.
plt.figure(figsize=(12, 5))
sns.regplot(y=df.V2, x=df.V26)
There's a significant positive correlation between V2 and V26
plt.figure(figsize=(12, 5))
sns.regplot(y=df.V3, x=df.V23)
There's a significant negative correlation between V3 and V23.
plt.figure(figsize=(12, 5))
sns.regplot(y=df.V6, x=df.V11)
There's a significant positive correlation between V6 and V11.
plt.figure(figsize=(12, 5))
sns.regplot(y=df.V6, x=df.V20)
There's a significant negative correlation between V11 and V20
plt.figure(figsize=(12, 5))
sns.regplot(y=df.V7, x=df.V15)
There's a significant positive correlation between V7 and V15.
plt.figure(figsize=(12, 5))
sns.regplot(y=df.V8, x=df.V15)
There is no significant correlation between V8 and V15.
plt.figure(figsize=(12, 5))
sns.regplot(y=df.V8, x=df.V23)
There's a significant positive correlation between V8 and V23.
plt.figure(figsize=(12, 5))
sns.regplot(y=df.V9, x=df.V16)
There's a significant negative correlation between V9 and V16.
plt.figure(figsize=(12, 5))
sns.regplot(y=df.V11, x=df.V29)
There's a significant positive correlation between V11 and V29.
plt.figure(figsize=(12, 5))
sns.regplot(y=df.V14, x=df.V38)
There's a significant negative correlation between V14 and V38.
plt.figure(figsize=(12, 5))
sns.regplot(y=df.V16, x=df.V21)
There's a significant positive correlation between V16 and V21.
plt.figure(figsize=(12, 5))
sns.regplot(y=df.V17, x=df.V27)
There's a significant negative correlation between V17 and V27.
plt.figure(figsize=(12, 5))
sns.regplot(y=df.V21, x=df.V35)
There's a significant negative correlation between V21 and V35.
plt.figure(figsize=(12, 5))
sns.regplot(y=df.V24, x=df.V27)
There's a significant negative correlation between V24 and V27.
plt.figure(figsize=(12, 5))
sns.regplot(y=df.V24, x=df.V32)
There's a significant positive correlation between V24 and V32.
plt.figure(figsize=(12, 5))
sns.regplot(y=df.V25, x=df.V27)
There's a significant positive correlation between V25 and V27.
plt.figure(figsize=(12, 5))
sns.regplot(y=df.V25, x=df.V30)
There's a significant negative correlation between V25 and V30.
plt.figure(figsize=(12, 5))
sns.regplot(y=df.V25, x=df.V32)
There's a significant negative correlation between V25 and V32.
plt.figure(figsize=(12, 5))
sns.regplot(y=df.V25, x=df.V33)
There's a significant negative correlation between V25 and V33.
plt.figure(figsize=(12, 5))
sns.regplot(y=df.V27, x=df.V32)
There's a significant positive correlation between V27 and V32.
plt.figure(figsize=(12, 5))
sns.regplot(y=df.V36, x=df.V39)
There's a significant positive correlation between V36 and V39
After looking at the correlation graphs of sevaral variables in our model, we can conclude that the variables V2, V3, V6, V7, V8, V9, V11, V14, V16, V17, V20, V21, V23, V24, V25, V27, V29, V30, V32, V33, V35, V36, V38 and V39 are significant enough for out initial model
After several model check ups, some of our variables may no longer be considered.
Note: We will be imputing the missing values from our data. For the testing set, we will be using the file "Test.csv.csv" which we previously saved as df1.
##Data preparation
# Separating target variable and other variables
X = df.drop(columns="Target")
X = (X)
Y = df["Target"]
## Splitting data into training, validation set:
# first we split data into 2 parts, say temporary and test
X_temp, X_test, y_temp, y_test = train_test_split(
X, Y, test_size=0.2, random_state=1, stratify=Y
)
# we split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.30, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
imputer = SimpleImputer(strategy="median")
# Fit and transform the train data
X_train = pd.DataFrame(imputer.fit_transform(X_train), columns=X_train.columns)
# Transform the validation data
X_val = pd.DataFrame(imputer.transform(X_val), columns=X_train.columns)
#Transform the test data
X_test = pd.DataFrame(imputer.fit_transform(X_test), columns=X_test.columns)
# Checking that no column has missing values in train or test sets
print(X_train.isna().sum())
print("-" * 30)
print(X_val.isna().sum())
print("-" * 30)
print(X_test.isna().sum())
print("-" * 30)
There are no missing value now in all of our sets, we can now go to the model building part.
The nature of predictions made by the classification model will translate as follows:
-True positives (TP) are failures correctly predicted by the model.
-False negatives (FN) are real failures in a generator where there is no detection by model.
-False positives (FP) are failure detections in a generator where there is no failure.
Which metric to optimize?
-We need to choose the metric which will ensure that the maximum number of generator failures are predicted correctly by the model.
-We would want Recall to be maximized as greater the Recall, the higher the chances of minimizing false negatives.
-We want to minimize false negatives because if a model predicts that a machine will have no failure when there will be a failure, it will increase the maintenance cost.
Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.
## defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1
},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
Defining scorer to be used for cross-validation and hyperparameter tuning
We want to reduce false negatives and will try to maximize "Recall".
To maximize Recall, we can use Recall as a scorer in cross-validation and hyperparameter tuning.
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
Sample Decision Tree model building with original data
%%time
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring = scorer,cv=kfold
)
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results1)
ax.set_xticklabels(names)
plt.show()
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
)
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
# Synthetic Minority Over Sampling Technique
from imblearn.over_sampling import SMOTE
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
%%time
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train_over, y=y_train_over, scoring = scorer,cv=kfold
)
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train_over, y_train_over)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results1)
ax.set_xticklabels(names)
plt.show()
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train_over, y=y_train_over, scoring=scorer, cv=kfold
)
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train_over, y_train_over)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
# Random undersampler for under sampling the data
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
%%time
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train_un, y=y_train_un, scoring = scorer,cv=kfold
)
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train_un, y_train_un)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results1)
ax.set_xticklabels(names)
plt.show()
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train_un, y=y_train_un, scoring=scorer, cv=kfold
)
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train_un, y_train_un)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Hyperparameter tuning can take a long time to run, so to avoid that time complexity - you can use the following grids, wherever required.
For Gradient Boosting: param_grid = { "n_estimators": np.arange(100,150,25), "learning_rate": [0.2, 0.05, 1], "subsample":[0.5,0.7], "max_features":[0.5,0.7] }
For Adaboost: param_grid = { "n_estimators": [100, 150, 200], "learning_rate": [0.2, 0.05], "base_estimator": [DecisionTreeClassifier(max_depth=1, random_state=1), DecisionTreeClassifier(max_depth=2, random_state=1), DecisionTreeClassifier(max_depth=3, random_state=1), ] }
For Bagging Classifier: param_grid = { 'max_samples': [0.8,0.9,1], 'max_features': [0.7,0.8,0.9], 'n_estimators' : [30,50,70], }
For Random Forest: param_grid = { "n_estimators": [200,250,300], "min_samples_leaf": np.arange(1, 4), "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'], "max_samples": np.arange(0.4, 0.7, 0.1) }
For Decision Trees: param_grid = { 'max_depth': np.arange(2,6), 'min_samples_leaf': [1, 4, 7], 'max_leaf_nodes' : [10, 15], 'min_impurity_decrease': [0.0001,0.001] }
For Logistic Regression: param_grid = {'C': np.arange(0.1,1.1,0.1)}
For XGBoost: param_grid={ 'n_estimators': [150, 200, 250], 'scale_pos_weight': [5,10], 'learning_rate': [0.1,0.2], 'gamma': [0,3,5], 'subsample': [0.8,0.9] }
# original model
Model = DecisionTreeClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
'min_samples_leaf': [1, 4, 7],
'max_leaf_nodes' : [10,15],
'min_impurity_decrease': [0.0001,0.001] }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
# oversampled model
Model = DecisionTreeClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
'min_samples_leaf': [1, 4, 7],
'max_leaf_nodes' : [10,15],
'min_impurity_decrease': [0.0001,0.001] }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
# undersampled model
Model = DecisionTreeClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
'min_samples_leaf': [1, 4, 7],
'max_leaf_nodes' : [10,15],
'min_impurity_decrease': [0.0001,0.001] }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
# building model with best parameters
dt_tuned_grid = DecisionTreeClassifier(
max_depth= 5,
min_samples_leaf= 1,
max_leaf_nodes= 15,
min_impurity_decrease= 0.0001,
)
# Fit the model on training data
dt_tuned_grid.fit(X_train, y_train)
dt_grid_train = model_performance_classification_sklearn(
dt_tuned_grid, X_train, y_train
)
dt_grid_train
dt_grid_val = model_performance_classification_sklearn(
dt_tuned_grid, X_val, y_val
)
dt_grid_val
confusion_matrix_sklearn(dt_tuned_grid, X_train, y_train)
-The tuned dtree model is overfitting the training data
-The validation recall is bigger than 50% i.e. the model is good at identifying the likelihood of failure.
# building model with best parameters
dt_tuned = DecisionTreeClassifier(
max_depth= 5,
min_samples_leaf= 2,
max_leaf_nodes= 15,
min_impurity_decrease= 0.0001,
)
# Fit the model on training data
dt_tuned.fit(X_train, y_train)
## To check the performance on training set
dt_random_train = model_performance_classification_sklearn(
dt_tuned, X_train, y_train
)
dt_random_train
## To check the performance on validation set
dt_random_val = model_performance_classification_sklearn(
dt_tuned, X_val, y_val
)
dt_random_val
# original model
Model = GradientBoostingClassifier(random_state=1)
# Parameter gradient boosting to pass in RandomSearchCV
param_grid = { "n_estimators": np.arange(100,150,25), "learning_rate": [0.2, 0.05, 1], "subsample":[0.5,0.7], "max_features":[0.5,0.7] }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
# Creating new pipeline with best parameters
tuned_gbm = GradientBoostingClassifier(
max_features=0.7,
random_state=1,
learning_rate=1,
n_estimators=75,
subsample=0.3
)
tuned_gbm.fit(X_train, y_train)
gbm_random_train= model_performance_classification_sklearn(
tuned_gbm, X_train, y_train
)
gbm_random_train
gbm_random_val = model_performance_classification_sklearn(tuned_gbm, X_val, y_val)
gbm_random_val
This model has a reasonably high accuracy of 0.957, with a recall of 0.617, making it one of our best models so far.
# Creating new pipeline with best parameters
tuned_gbm1 = GradientBoostingClassifier(
max_features=0.7,
random_state=1,
learning_rate=1,
n_estimators=50,
subsample=0.2
)
tuned_gbm1.fit(X_train, y_train)
gbm_grid_train = model_performance_classification_sklearn(
tuned_gbm1, X_train, y_train
)
gbm_grid_train
gbm_grid_val = model_performance_classification_sklearn(tuned_gbm1, X_val, y_val)
gbm_grid_val
# original model
Model = AdaBoostClassifier(random_state=1)
# Parameter gradient boosting to pass in RandomSearchCV
param_grid = { "n_estimators": [100, 150, 200], "learning_rate": [0.2, 0.05], "base_estimator": [DecisionTreeClassifier(max_depth=1, random_state=1), DecisionTreeClassifier(max_depth=2, random_state=1), DecisionTreeClassifier(max_depth=3, random_state=1), ] }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
# building model with best parameters
adb_tuned1 = AdaBoostClassifier(
n_estimators=30,
learning_rate=1,
random_state=1,
base_estimator=DecisionTreeClassifier(max_depth=2, random_state=1),
)
# Fit the model on training data
adb_tuned1.fit(X_train, y_train)
# Calculating different metrics on train set
Adaboost_grid_train = model_performance_classification_sklearn(
adb_tuned1, X_train, y_train
)
print("Training performance:")
Adaboost_grid_train
# Calculating different metrics on validation set
Adaboost_grid_val = model_performance_classification_sklearn(adb_tuned1, X_val, y_val)
print("Validation performance:")
Adaboost_grid_val
# creating confusion matrix
confusion_matrix_sklearn(adb_tuned1, X_val, y_val)
This is our best model so far with a minimum of false positives and false negatives, as well as high scores of every indicator.
# building model with best parameters
adb_tuned2 = AdaBoostClassifier(
n_estimators=20,
learning_rate=1,
random_state=1,
base_estimator=DecisionTreeClassifier(max_depth=3, random_state=1),
)
# Fit the model on training data
adb_tuned2.fit(X_train, y_train)
# Calculating different metrics on train set
Adaboost_random_train = model_performance_classification_sklearn(
adb_tuned2, X_train, y_train
)
print("Training performance:")
Adaboost_random_train
# Calculating different metrics on validation set
Adaboost_random_val = model_performance_classification_sklearn(adb_tuned2, X_val, y_val)
print("Validation performance:")
Adaboost_random_val
# creating confusion matrix
confusion_matrix_sklearn(adb_tuned2, X_val, y_val)
# training performance comparison
models_train_comp_df = pd.concat(
[
Adaboost_grid_train.T,
Adaboost_random_train.T,
gbm_grid_train.T,
gbm_random_train.T,
dt_grid_train.T,
dt_random_train.T
],
axis=1,
)
models_train_comp_df.columns = [
"AdaBoost Tuned with Grid search",
"AdaBoost Tuned with Random search",
"Gradient Boosting Tuned with Grid search",
"Gradient Boosting Tuned with Random search",
"Decision tree with grid search",
"Decision tree with random search",
]
print("Training performance comparison:")
models_train_comp_df
# Validation performance comparison
models_val_comp_df = pd.concat(
[
Adaboost_grid_val.T,
Adaboost_random_val.T,
gbm_grid_val.T,
gbm_random_val.T,
dt_grid_val.T,
dt_random_val.T
],
axis=1,
)
models_val_comp_df.columns = [
"AdaBoost Tuned with Grid search",
"AdaBoost Tuned with Random search",
"Gradient Boosting Tuned with Grid search",
"Gradient Boosting Tuned with Random search",
"Decision tree with grid search",
"Decision tree with random search",
]
print("Validation performance comparison:")
models_val_comp_df
We conclude that the AdaBoost Tuning with Random search is our best model available considering its accuracy and recall levels. Therefore we will select it.
Let's check our final model by testing it.
# Calculating different metrics on the test set
adaboost_random_test = model_performance_classification_sklearn(adb_tuned2, X_test, y_test)
print("Test performance:")
adaboost_random_test
-The performance on test data is generalised
-Let's check the important features for prediction as per the the final model
feature_names = X.columns
importances = tuned_gbm1.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Now that we have a final model, let's use pipelines to put the model into production. We know that we can use pipelines to standardize the model building, but the steps in a pipeline are applied to each and every variable. How can we personalize the pipeline to perform different preprocessing steps on different columns? - By using Column Transformer Column Transformer Column Transformer allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space This is useful for heterogeneous or columnar data to combine several feature extraction mechanisms or transformations into a single transformer We will create 2 different pipelines, one for numerical columns and one for categorical columns For numerical columns, we will do missing value imputation as pre-processing For categorical columns, we will do one hot encoding and missing value imputation as pre-processing Note: We will be doing missing value imputation for the whole data so that if there are any missing values in the data in future, they can be taken car
# creating a list of numerical variables
numerical_features = [
"V15",
"V7",
"V18",
"V14",
"V13",
"V35",
"V21"
]
# creating a transformer for numerical variables, which will apply simple imputer on the numerical variables
numeric_transformer = Pipeline(steps=[("imputer", SimpleImputer(strategy="median"))])
# combining categorical transformer and numerical transformer using a column transformer
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numerical_features),
]
)
# remainder = "passthrough" has been used, it will allow variables that are present in original data
# but not in "numerical_columns" and "categorical_columns" to pass through the column transformer without any changes
# Separating target variable and other variables
X = df.drop(columns="Target")
Y = df["Target"]
# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.20, random_state=1, stratify=Y
)
print(X_train.shape, X_test.shape)
# Creating new pipeline with best parameters
model = Pipeline(
steps=[
("pre", preprocessor),
(
"AdaBoostClassifier",
AdaBoostClassifier(
base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=1, n_estimators=20, random_state=1
),
),
]
)
# Fit the model on training data
model.fit(X_train, y_train)
# Let's check the performance on test set
Model_test = model_performance_classification_sklearn(model, X_test, y_test)
Model_test
So far, our model presented a 96.8 % accuracy rate, a 78.8 % precision rate and a 59% recall rate, being a relatively good prediction model for the failure of an engine. Some key reccomendations for improving further modeling would be to have a larger base of hidden sensors so there is a more significative sample to develop a model with a higher prediction rate.